We did scRNAseq (10X version 3) of mouse endochondral bones cell populations (in a balanced/even representation way, by FACS isolation) in perinatal stages E18 (just before birth) and postnatal day 1 (PN1).
The files used as input for this analysis are the files coming out of the the cellranger count pipeline.
In this case I used cellranger v6.0.0.
library(Matrix)
library(dplyr)
library(Seurat)
library(patchwork)
library(ggplot2)
library(DoubletFinder)
library(BiocStyle)
library(dittoSeq)
library(RColorBrewer)
ggplotColours <- function(n = 6, h = c(0, 360) + 15){
if ((diff(h) %% 360) < 1) h[2] <- h[2] - 360/n
hcl(h = (seq(h[1], h[2], length = n)), c = 100, l = 65)
}
Seurat is also able to analyze data from multimodal 10X experiments processed using CellRanger v3.
I first read all the data and then create a seurat object with the gene expression, and append the antibody and multiplex assays
# Load the Mo_PN1_MesCells dataset
cluster_dir <- "~/scratch/stromal_cells/1-cellranger_mapping/Mo_PN1_MesCells/Mo_PN1_MesCells"
Mo_PN1_MesCells.data <- Read10X(data.dir = paste0(cluster_dir,"/outs/filtered_feature_bc_matrix/"))
# Initialize the Seurat object with the raw (non-normalized data).
Mo_PN1_MesCells <- CreateSeuratObject(counts = Mo_PN1_MesCells.data,
project = "Mo_PN1_MesCells",
min.cells = 3, min.features = 200)
rm(Mo_PN1_MesCells.data)
saveRDS(Mo_PN1_MesCells, "Mo_PN1_MesCells_init.rds")
Mo_PN1_MesCells <- readRDS("Mo_PN1_MesCells_init.rds")
dim(Mo_PN1_MesCells)
## [1] 19962 8045
The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.
Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include
The number of unique genes detected in each cell. Low-quality cells or empty droplets will often have very few genes Cell doublets or multiplets may exhibit an aberrantly high gene count Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes) The percentage of reads that map to the mitochondrial genome Low-quality / dying cells often exhibit extensive mitochondrial contamination We calculate mitochondrial QC metrics with the PercentageFeatureSet function, which calculates the percentage of counts originating from a set of features We use the set of all genes starting with MT- as a set of mitochondrial genes
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
Mo_PN1_MesCells[["percent.mt"]] <- PercentageFeatureSet(Mo_PN1_MesCells, pattern = "^mt-")
# Show QC metrics for the first 5 cells
head(Mo_PN1_MesCells@meta.data, 10)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCCAAGCCTGGAA-1 Mo_PN1_MesCells 14819 4114 1.9164586
## AAACCCAAGCGTGAAC-1 Mo_PN1_MesCells 1404 619 17.0940171
## AAACCCACAACCCTCT-1 Mo_PN1_MesCells 1906 803 0.1049318
## AAACCCACAAGAGCTG-1 Mo_PN1_MesCells 19092 4461 2.7341295
## AAACCCACATCGATGT-1 Mo_PN1_MesCells 2391 996 0.4182350
## AAACCCAGTGAACGGT-1 Mo_PN1_MesCells 5874 2260 0.9703779
## AAACCCATCATGCCGG-1 Mo_PN1_MesCells 8852 2713 2.9710800
## AAACCCATCTGATGGT-1 Mo_PN1_MesCells 7400 1064 0.4459459
## AAACCCATCTTACTGT-1 Mo_PN1_MesCells 7739 2687 1.4342938
## AAACGAAAGAGAGGTA-1 Mo_PN1_MesCells 10490 2841 1.9542421
In the example below, we visualize QC metrics, and use these to filter cells.
We filter cells that have unique feature counts over 2,500 or less than 200 We filter cells that have >5% mitochondrial counts
# Visualize QC metrics as a violin plot
VlnPlot(Mo_PN1_MesCells, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
group.by = "orig.ident", ncol = 3, pt.size = 0)
plot(x = Mo_PN1_MesCells@meta.data$nCount_RNA,log = "y",
xlab = "RNA counts",ylab = "Features",
y = Mo_PN1_MesCells@meta.data$nFeature_RNA,
type = "p",pch=20 , cex= 0.2, col="darkblue")
abline(h = c(400,6000), lty=2)
abline(v = c(1000), lty=2)
plot(x = Mo_PN1_MesCells@meta.data$nFeature_RNA,
xlab = "Features",ylab = "perc.mit",
y = Mo_PN1_MesCells@meta.data$percent.mt,
type = "p",pch=20 , cex= 0.2,#xlim = c(0,35000),
col="darkblue")
abline(v = c(400,6000), lty=2)
abline(h = c(10), lty=2)
Mo_PN1_MesCells <- subset(Mo_PN1_MesCells, subset = nFeature_RNA > 400 & nFeature_RNA < 6000 & percent.mt < 10)
dim(Mo_PN1_MesCells)
## [1] 19962 8045
Mo_PN1_MesCells <- readRDS("Mo_PN1_MesCells.rds")
dim(Mo_PN1_MesCells)
## [1] 19962 7714
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in Mo_PN1_MesCells[[“RNA”]]@data.
Mo_PN1_MesCells <- NormalizeData(Mo_PN1_MesCells, normalization.method = "LogNormalize",
scale.factor = 10000)
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.
Our procedure in Seurat3 is described in detail here, and improves on previous versions by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the FindVariableFeatures function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.
Mo_PN1_MesCells <- FindVariableFeatures(Mo_PN1_MesCells, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(Mo_PN1_MesCells), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(Mo_PN1_MesCells)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2
Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData function:
Shifts the expression of each gene, so that the mean expression across cells is 0 Scales the expression of each gene, so that the variance across cells is 1 This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate The results of this are stored in Mo_PN1_MesCells[[“RNA”]]@scale.data
all.genes <- rownames(Mo_PN1_MesCells)
Mo_PN1_MesCells <- ScaleData(Mo_PN1_MesCells, features = all.genes)
Next we perform PCA on the scaled data. By default, only the previously determined variable features are used as input, but can be defined using features argument if you wish to choose a different subset.
Mo_PN1_MesCells <- RunPCA(Mo_PN1_MesCells, features = VariableFeatures(object = Mo_PN1_MesCells))
Seurat provides several useful ways of visualizing both cells and features that define the PCA, including VizDimReduction, DimPlot, and DimHeatmap
# Examine and visualize PCA results a few different ways
print(Mo_PN1_MesCells[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: Sparc, Serpinh1, Col1a2, Col1a1, Col6a3
## Negative: Coro1a, Cd52, Fcer1g, Srgn, Rac2
## PC_ 2
## Positive: Stfa2, Retnlg, Asprv1, Cstdc5, Cstdc4
## Negative: Birc5, Top2a, Cdca8, Pclaf, Stmn1
## PC_ 3
## Positive: Cdh5, Esam, Emcn, Adgrf5, Egfl7
## Negative: Col6a3, Nusap1, Col1a2, Top2a, Col1a1
## PC_ 4
## Positive: Ctss, Ms4a6c, Ms4a6d, Ms4a6b, Ifi207
## Negative: Stfa2, Cd24a, Cstdc5, Retnlg, Pglyrp1
## PC_ 5
## Positive: Anxa8, Fam180a, Cd109, Aqp1, Gm17455
## Negative: Agtr2, Cdkn1c, Smoc2, Lpl, Zim1
VizDimLoadings(Mo_PN1_MesCells, dims = 1:2, reduction = "pca")
DimPlot(Mo_PN1_MesCells,group.by = "orig.ident", reduction = "pca")
DimHeatmap(Mo_PN1_MesCells, dims = 1:9, cells = 500, balanced = TRUE)
ElbowPlot(Mo_PN1_MesCells, ndims = 50)
Seurat v3 applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, the approach to partioning the cellular distance matrix into clusters has dramatically improved. The approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015]. Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.
As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors function, and takes as input the previously defined dimensionality of the dataset (first 25 PCs).
To cluster the cells, we next apply modularity optimization techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters can be found using the Idents function.
Mo_PN1_MesCells <- FindNeighbors(Mo_PN1_MesCells, dims = 1:25)
Mo_PN1_MesCells <- FindClusters(Mo_PN1_MesCells, resolution = 0.2)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 7714
## Number of edges: 264898
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9510
## Number of communities: 18
## Elapsed time: 0 seconds
Mo_PN1_MesCells <- FindClusters(Mo_PN1_MesCells, resolution = 0.3)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 7714
## Number of edges: 264898
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9380
## Number of communities: 19
## Elapsed time: 0 seconds
Mo_PN1_MesCells <- FindClusters(Mo_PN1_MesCells, resolution = 0.1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 7714
## Number of edges: 264898
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9693
## Number of communities: 12
## Elapsed time: 0 seconds
Mo_PN1_MesCells <- FindClusters(Mo_PN1_MesCells, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 7714
## Number of edges: 264898
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9195
## Number of communities: 23
## Elapsed time: 0 seconds
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
Mo_PN1_MesCells <- RunUMAP(Mo_PN1_MesCells, dims = 1:25)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(Mo_PN1_MesCells, reduction = "umap", label = T, label.size = 5, group.by = "RNA_snn_res.0.2" )
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
FeaturePlot(Mo_PN1_MesCells, reduction = "umap",
features = c("nCount_RNA", "nFeature_RNA"))
I use here DoubletFinder for predcting doublets in the dataset
DoubletFinder can be broken up into 4 steps: (1) Generate artificial doublets from existing scRNA-seq data (2) Pre-process merged real-artificial data (3) Perform PCA and use the PC distance matrix to find each cell’s proportion of artificial k nearest neighbors (pANN) (4) Rank order and threshold pANN values according to the expected number of doublets
## pK Identification (no ground-truth) ---------------------------------------------------------------------------------------
sweep.res.list_Fpl <- paramSweep_v3(Mo_PN1_MesCells, PCs = 1:10, sct = FALSE)
sweep.stats_Fpl <- summarizeSweep(sweep.res.list_Fpl, GT = FALSE)
bcmvn_Fpl <- find.pK(sweep.stats_Fpl)
## Homotypic Doublet Proportion Estimate -------------------------------------------------------------------------------------
homotypic.prop <- modelHomotypic(Mo_PN1_MesCells@meta.data$RNA_snn_res.0.2) ## ex: annotations <- seu_kidney@meta.data$ClusteringResults
nExp_poi <- round(0.075*nrow(Mo_PN1_MesCells@meta.data)) ## Assuming 7.5% doublet formation rate - tailor for your dataset
nExp_poi.adj <- round(nExp_poi*(1-homotypic.prop))
Mo_PN1_MesCells <- doubletFinder_v3(Mo_PN1_MesCells, PCs = 1:10,
pN = 0.25, pK = 0.09,
nExp = nExp_poi, reuse.pANN = FALSE,
sct = FALSE)
Mo_PN1_MesCells <- doubletFinder_v3(Mo_PN1_MesCells, PCs = 1:10,
pN = 0.25, pK = 0.09,
nExp = nExp_poi.adj,
reuse.pANN = "pANN_0.25_0.09_579",
sct = FALSE)
DimPlot(Mo_PN1_MesCells, group.by="DF.classifications_0.25_0.09_579",
cols = c("red","grey"),
reduction="umap", pt.size=0.3
)
saveRDS(Mo_PN1_MesCells, "Mo_PN1_MesCells.rds")
From Gretel:
There are some “pan” markers for the different compartments and clusters and highly expressed markers (altough not exclusive) for clusters already well defined in the literature, I send you some examples and explanations.
These clusters were defined following the unsupervised clustering first and redifined in more subclusters from our knowledge of the literature. Without doublets or dead cell filtration.
In the stromal compartment there are clusters differents from adult and we have many questions, the aim of the study.
Other markers are: CD52, CD53
FeaturePlot(Mo_PN1_MesCells,
features = c("Ptprc", # Leukocyte lineage
"Cd52", #
"Cd53") #
)
Is identified by Hbb-bt, Hba-a1, Hba-a2 (by FACS with TER119)
FeaturePlot(Mo_PN1_MesCells,
features = c("Hbb-bt", #
"Hba-a1", #
"Hba-a2") #
)
Tek, Tie1, Cdh5, Sox7, Pecam1, (Eng, CD200, Ly6a, not exclusives but highly expressed in these cells)
In perinatal is only one cluster but in adult 2 clusters are well defined: the arteriolar and the sinusoidal endothelial cells.
FeaturePlot(Mo_PN1_MesCells,
features = c("Tek", #
"Tie1", #
"Cdh5",
"Pecam1") #
)
Pan markers: Pdgfra, Prrx1, Col1a2,
FeaturePlot(Mo_PN1_MesCells,
features = c("Pdgfra", #
"Prrx1", #
"Col1a2") #
)
Committed lineages are easier to identify.
Sp7, Runx2, Bglap, Bglap2,Bglap3, Mmp13, Col13a, Ibsp, Col1a1 (in perinatal stages they are all together) but are mainly in two populations in adult (according to the literature are progenitors of the osteo lineage).
FeaturePlot(Mo_PN1_MesCells,
features = c("Sp7", #
"Mmp13", #
"Bglap",
"Bglap2")#
)
Sox9,Sox5,Sox6, Col2a1, Acan, Col9a3, Col9a1,Nt5e (In perinatal stages some of the markers are not exclusive but the combination of all define the cluster).
FeaturePlot(Mo_PN1_MesCells,order = T,
features = c("Sox9", #
"Sox5", #
"Sox6",
"Col2a1")#
)
Myofibroblasts: Pax7, Myf5, Myod1, Mrln, Msc
FeaturePlot(Mo_PN1_MesCells,order = T,
features = c("Pax7", #
"Myf5", #
"Myod1",
"Mrln")#
)
Other clusters in the bone
Neural cells: Schwann cells: Sox10, Mal, Plp1
FeaturePlot(Mo_PN1_MesCells,order = T,
features = c("Sox10", #
"Mal", #
"Plp1" )
)
DimPlot(Mo_PN1_MesCells, label = T, group.by = "RNA_snn_res.0.2" )
dittoBarPlot(object = Mo_PN1_MesCells,
var = "DF.classifications_0.25_0.09_579",
group.by = "RNA_snn_res.0.2")
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 dittoSeq_1.4.1 DoubletFinder_2.0.3 ggplot2_3.3.5 patchwork_1.1.1 SeuratObject_4.0.2 Seurat_4.0.3 dplyr_1.0.7 Matrix_1.3-4 BiocStyle_2.20.2
##
## loaded via a namespace (and not attached):
## [1] plyr_1.8.6 igraph_1.2.6 lazyeval_0.2.2 splines_4.1.0 listenv_0.8.0 scattermore_0.7 GenomeInfoDb_1.28.1 digest_0.6.27 htmltools_0.5.2 magick_2.7.2 fansi_0.5.0 magrittr_2.0.1 tensor_1.5 cluster_2.1.2 ROCR_1.0-11 globals_0.14.0 matrixStats_0.61.0 spatstat.sparse_2.0-0 colorspace_2.0-2 ggrepel_0.9.1 xfun_0.24 crayon_1.4.1 RCurl_1.98-1.3 jsonlite_1.7.2 spatstat.data_2.1-0 survival_3.2-11 zoo_1.8-9 glue_1.4.2 polyclip_1.10-0 gtable_0.3.0 zlibbioc_1.38.0
## [32] XVector_0.32.0 leiden_0.3.8 DelayedArray_0.18.0 future.apply_1.7.0 SingleCellExperiment_1.14.1 BiocGenerics_0.38.0 abind_1.4-5 scales_1.1.1 pheatmap_1.0.12 DBI_1.1.1 miniUI_0.1.1.1 Rcpp_1.0.7 viridisLite_0.4.0 xtable_1.8-4 reticulate_1.20 spatstat.core_2.3-0 stats4_4.1.0 htmlwidgets_1.5.4 httr_1.4.2 ellipsis_0.3.2 ica_1.0-2 pkgconfig_2.0.3 farver_2.1.0 sass_0.4.0 uwot_0.1.10 deldir_0.2-10 utf8_1.2.2 tidyselect_1.1.1 labeling_0.4.2 rlang_0.4.11 reshape2_1.4.4
## [63] later_1.3.0 munsell_0.5.0 tools_4.1.0 generics_0.1.0 ggridges_0.5.3 evaluate_0.14 stringr_1.4.0 fastmap_1.1.0 yaml_2.2.1 goftest_1.2-2 knitr_1.33 fitdistrplus_1.1-5 purrr_0.3.4 RANN_2.6.1 pbapply_1.5-0 future_1.22.1 nlme_3.1-152 mime_0.11 compiler_4.1.0 plotly_4.9.4.1 png_0.1-7 spatstat.utils_2.2-0 tibble_3.1.4 bslib_0.3.0 stringi_1.7.4 highr_0.9 lattice_0.20-44 vctrs_0.3.8 pillar_1.6.2 lifecycle_1.0.0 BiocManager_1.30.16
## [94] spatstat.geom_2.2-2 lmtest_0.9-38 jquerylib_0.1.4 RcppAnnoy_0.0.19 data.table_1.14.0 cowplot_1.1.1 bitops_1.0-7 irlba_2.3.3 httpuv_1.6.3 GenomicRanges_1.44.0 R6_2.5.1 bookdown_0.22 promises_1.2.0.1 KernSmooth_2.23-20 gridExtra_2.3 IRanges_2.26.0 parallelly_1.28.1 codetools_0.2-18 MASS_7.3-54 assertthat_0.2.1 SummarizedExperiment_1.22.0 withr_2.4.2 sctransform_0.3.2 S4Vectors_0.30.0 GenomeInfoDbData_1.2.6 mgcv_1.8-35 parallel_4.1.0 grid_4.1.0 rpart_4.1-15 tidyr_1.1.3 rmarkdown_2.9
## [125] MatrixGenerics_1.4.3 Rtsne_0.15 Biobase_2.52.0 shiny_1.6.0